1 Introduction

Public companies, domestic and foreign, are required to provide filings to the U.S. Securities and Exchange Commission (SEC) in which they contain company financial and operational information in different forms and periods such as quarterly reports (10-Q), annual reports (10-K). In addition, the filing process is managed through EDGAR, which is the Electronic Data Gathering, Analysis, and Retrieval system. This provides advantages to investors, companies, and the U.S. economy by enhancing transparency, fairness, and efficiency of securities markets.

Furthermore, the annual report, 10-K form, yields companies a great opportunity to illustrate their business information, risk analysis, financial data, and management’s discussion and analysis (MD&A), which is the only part that companies express business performance by not conforming with the Generally Accepted Accounting Principles (GAAP) . This possibly results in a positive bias in which companies specifically use positive words in the discussion part to avoid negative effect on stock price. In fact, [@Taylor] found that the sentiment diminishes after the non-GAAP texts are removed. Consequently, this leads to the questions whether the analysis from MD&A part could generally provide meaningful insights to investors and the effect of the text sentiment per se.

Therefore, the purpose of this report is to extract insights from the MD&A part of S&P 500 companies in Information and Technology sector (GICS) and analyze them to predict stock price value by using content-based analysis approach i.e., text mining. In addition, the approach was firstly developed by [@Loughran], who conducted textual analysis of 10-Ks and the effect on stock returns, particularly in the financial context, and eventually developed a dictionary used in the textual analysis. Furthermore, [@Chouliaras] concluded that 10-K negative polarity affected stock price after filing on a monthly basis. Particularly, [@Ahern] evidenced that firms who manipulate media coverage gain an advantage of increasing better stock prices, which signifies textual sentiment impact on the stock price.

The rest of this report is organized as follows. Section 2 provides an overall design of textual analysis – Natural Language Processing (NLP) pipelines. Next, in section 3 presents how corpus is constructed and general insights using bag-of-words and tidy text approaches [@Silge]. In addition, section 4 will provide how text features such as polarity formality; are extracted and the effect on the stock price. Lastly, section 5 presents the analysis based on topic modelling – how topics are retrieved and the analysis of marginal effect of the topics on the stock price.


2 Textual Analysis Design

2.1 Analysis Approach

The data is collected from https://www.sec.gov/edgar.shtml in which 10-K forms are retrieved, specifically for the MD&A section for 30 companies, from 2010 to 2020 depending on availability on EDGAR. Further explanation on data sampling and company selection criteria will be provided in section 3. After that, Natural Language processing pipelines are created beginning with text cleaning and metadata retrieval. Next, lemmatization and pos-tagging are applied to the text data to reduce word duplication and enhance word understanding or semantic coherence. The last part in the section 3 will provide bag of words analysis using TF-IDF approach1 for context understanding purpose.

Moreover, sentiment analysis in section 4 will present how to retrieve important features from the text data, which are polarity, formality, reading level, and stock price. The analysis in this part will discuss proper dictionary to use in the financial context; descriptive analysis of the stock price relationship with other features; and the effect of these features on the stock price using regression.

Lastly, in section 5, the text data is processed and cleaned for a corpus construction used in topic modelling following the similar approach to [@Schmiedel], who demonstrated the advantage of using structural topic modelling (STM). This section will provide the discussion of the optimal number of topics and topic evaluation including the relationship of features to topics using regression analysis. The approach is summarized in the Figure 1: Summary of analysis approach.

Figure 1 Summary of analysis approach


3 Corpus Construction

In this section, it presents how the data is downloaded from EDGAR, specifically for the MD&A part in 10-K forms, and parsed into a data frame. It also illustrates the pipelines starting with data cleaning, lemmatization and pos-tagging. After that, the analysis is split into 2 routes, unigram and bi-grams where it evidences how tokenization process is implemented including the insights based on TF-IDF approach. The approach summary can be found in Figure 2: Summary of corpus construction approach below.

Figure 2 Summary of corpus construction approach

3.1 Data import

The purpose of this project is to quantitatively predict the impact of sentiment and topics on the change in stock price. Therefore, the data for such analysis is collected from https://www.sec.gov/edgar.shtml where 10-K forms of 30 companies from S&P 500 companies, specifically in the Information Technology sector (GICS), are randomly selected. In addition, the additional criterion of company selection is based on a number of companies in each sub industry, in which top 5 sub industries with the highest number of companies are selected, which are Semiconductors (Semi Con); Data Processing & Outsourced Services (DP); Application Software (App); Technology, Hardware (Tech Hardware), Storage & Peripherals; and IT Consulting & Other Services (IT Consult). The summary is presented in Figure 3: Summary of the number of companies in IT sector below.

Figure 3 Summary of the number of companies in IT sector For each sub industry, companies are randomly selected, and the data, specifically in the MD&A part is collected from 2010 to 2020, where it is available. The list of companies, sub industry, and sample size are summarized in the Figure 4: Sample size of each company across sub industries below.

Figure 4 Sample size of each company across sub industries

# set up a user for EDGAR to prevent errors of using automated tools
Sys.setenv("User-Agent"="EDGARWEBR_USER_AGENT")
  • xxxxx
# create lists of companies across sub industries using CIK numbers
app_software = c("796343", "883241", "896878", "1108524", "1341439", "1013462")
dp_outsource = c("1175454","723531","1101215","1123360","1141391","1365135","1403161","1136893")
it_con = c("749251","1058290","1467373","1336920")
semi_con = c("2488","50863","743316","804328","827054","1045810","4127","1730168")
tech_hw = c("108772","320193","106040","1137789")

cik_list = c(app_software, dp_outsource, it_con, semi_con, tech_hw)
  • xxxxx
# retrieve indexes and metadata across companies in the cik list
edgar::getMasterIndex(2010:2020)
indexes = data.frame()

index_files = list.files("Master Indexes/", pattern="Rda")

for(i in index_files){
  load(paste0("Master Indexes/",i))
  this_index <- year.master 
  indexes <- bind_rows(indexes,this_index)
  print(i)
}
# to cross check available data
indexes_2 = indexes %>% filter(cik %in% cik_list, form.type == "10-K")
# create system sleep function
testit <- function(x){
    p1 <- proc.time()
    Sys.sleep(x)
    proc.time() - p1 
}
  • Financial reports in 10-K forms are retrieved for all companies in the list from 2010 to 2020.
for(i in cik_list){
  print(i)

  for(y in c(2010:2020)){
  print(y)
  getFilings(cik.no = i, filing.year = y,form.type = "10-K")
  testit(10)
  }
}
  • Let’s adjust the function getMgmtDisc() from this source.
# create a function to retrieve only management discussion part 
getMgmtDisc_2 <- function(cik.no, filing.year) {
    
    f.type <- c("10-K", "10-K405","10KSB", "10KSB40")
    # 10-K, 10-K405, 10-KSB, 10-KT, 10KSB, 10KSB40, and 10KT405 filings in the EDGAR database

    # check the year validity
    if (!is.numeric(filing.year)) {
        cat("Please check the input year.")
        return()
    }
    
    output <- getFilings(cik.no = cik.no, form.type = f.type , filing.year, 
                         quarter = c(1, 2, 3, 4), downl.permit = "y")
    
    if (is.null(output)){
      cat("No annual statements found for given CIK(s) and year(s).")
      return()
    }
    
    cat("Extracting 'Item 7' section...\n")
    
    progress.bar <- txtProgressBar(min = 0, max = nrow(output), style = 3)
    
    # function for text cleaning
    CleanFiling2 <- function(text) {
      
      text <- gsub("[[:digit:]]+", "", text)  ## remove Alphnumerics
      
      text <- gsub("\\s{1,}", " ", text)
      
      text <- gsub('\"',"", text)
      
      #text <- RemoveStopWordsFilings(text)
      
      return(text)
    }

    new.dir <- paste0("MD&A section text")
    dir.create(new.dir)
    
    output$extract.status <- 0
    
    output$company.name <- toupper(as.character(output$company.name))
    output$company.name <- gsub("\\s{2,}", " ",output$company.name)
    
    for (i in 1:nrow(output)) {
        f.type <- gsub("/", "", output$form.type[i])
        cname <- gsub("\\s{2,}", " ",output$company.name[i])
        year <- output$filing.year[i]
        cik <- output$cik[i]
        date.filed <- output$date.filed[i]
        accession.number <- output$accession.number[i]
        
        dest.filename <- paste0("Edgar filings_full text/Form ", f.type, 
                                "/", cik, "/", cik, "_", f.type, "_", 
                                date.filed, "_", accession.number, ".txt")
        # read filing
        filing.text <- readLines(dest.filename)
        
        # extract data from first <DOCUMENT> to </DOCUMENT>
        tryCatch({
          filing.text <- filing.text[(grep("<DOCUMENT>", filing.text, ignore.case = TRUE)[1]):(grep("</DOCUMENT>", 
                                                                                                    filing.text, ignore.case = TRUE)[1])]
        }, error = function(e) {
          filing.text <- filing.text ## In case opening and closing DOCUMENT TAG not found, cosnider full web page
        })
        
        # See if 10-K is in XLBR or old text format
        if (any(grepl(pattern = "<xml>|<type>xml|<html>|10k.htm", filing.text, ignore.case = T))) {
            
            doc <- XML::htmlParse(filing.text, asText = TRUE, useInternalNodes = TRUE, addFinalizer = FALSE)
            
            f.text <- XML::xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", 
                XML::xmlValue)
            
            f.text <- iconv(f.text, "latin1", "ASCII", sub = " ")
            
                  ## free up htmlParse document to avoid memory leakage, this calls C function
            #.Call('RS_XML_forceFreeDoc', doc, package= 'XML')
            
        } else {
            f.text <- filing.text
        }
        
        # preprocessing the filing text
        f.text <- gsub("\\n|\\t|$", " ", f.text)
        f.text <- gsub("^\\s{1,}", "", f.text)
        f.text <- gsub(" s ", " ", f.text)
        
        # check for empty Lines and delete it
        empty.lnumbers <- grep("^\\s*$", f.text)
        
        if (length(empty.lnumbers) > 0) {
            f.text <- f.text[-empty.lnumbers]  ## remove all lines only with space
        }
        
        
        # get MD&A sections
        startline <- grep("^Item\\s{0,}7[^A]", f.text, ignore.case = TRUE)
        endline <- grep("^Item\\s{0,}7A", f.text, ignore.case = TRUE)
        
        # if dont have Item 7A, then take upto Item 8
        if (length(endline) == 0) {
            endline <- grep("^Item\\s{0,}8", f.text, ignore.case = TRUE)
        }
        
        md.dicusssion <- NA
        
        if (length(startline) != 0 && length(endline) != 0) {
            
            startline <- startline[length(startline)]
            endline <- endline[length(endline)] - 1
            
            md.dicusssion <- paste(f.text[startline:endline], collapse = " ")
            md.dicusssion <- gsub("\\s{2,}", " ", md.dicusssion)
            #words.count <- str_count(md.dicusssion, pattern = "\\S+")
            
            #md.dicusssion <- gsub(" co\\.| inc\\.| ltd\\.| llc\\.| comp\\.", " ", md.dicusssion, ignore.case = T)
            
            #md.dicusssion2 <- unlist(strsplit(md.dicusssion, "\\. "))
            #md.dicusssion2 <- paste0(md.dicusssion2, ".")
            #md.dicusssion <- CleanFiling2(md.dicusssion)
            header <- paste0("CIK: ", cik, "\n", "Company Name: ", cname, "\n", 
                             "Form Type : ", f.type, "\n", "Filing Date: ", date.filed, "\n",
                             "Accession Number: ", accession.number)  
            md.dicusssion <- paste0(header, "\n\n\n", md.dicusssion)
            
            
        }
        
        if( (!is.na(md.dicusssion)) ){
          filename2 <- paste0(new.dir, '/',cik, "_", f.type, "_", date.filed, 
                              "_", accession.number, ".txt")
          
          writeLines(md.dicusssion, filename2)
          output$extract.status[i] <- 1
        }
        
        # update progress bar
        setTxtProgressBar(progress.bar, i)
    }
    
    ## convert dates into R dates
    output$date.filed <- as.Date(as.character(output$date.filed), "%Y-%m-%d")

    # close progress bar
    close(progress.bar)
    
    output$quarter <- NULL
    output$filing.year <- NULL
    names(output)[names(output) == 'status'] <- 'downld.status'
    
    cat("MD&A section texts are stored in 'MD&A section text' directory.")
    
    return(output)
}
# download management discussion data
df = data.frame()

for(i in as.numeric(cik_list)){
  print(i)

  for(y in 2010:2020){
  print(y)
  data = getMgmtDisc_2(cik.no = i, filing.year = y)
  df = rbind(df, data)
  testit(10)
  }
}
# management discussion data frame
text_df = data.frame()

txt_files = list.files("MD&A section text/", pattern="txt")

for(i in 1:length(txt_files)){
  test = readLines(paste0("MD&A section text/",txt_files[i]))
  
  cik = test[1]
  comp_name = test[2]
  form_type = test[3]
  filing_date = test[4]
  access_no = test[5]
  text = test[8]
  
  rows <- data.frame(cik,comp_name,form_type,filing_date,access_no,text)
  text_df <- bind_rows(text_df,rows)
  
  print(i)
}

3.2 Data cleaning

  • The output from the above is cleansed by using the following tasks;
    • Lowering capital letters
    • Removing non UTF-8 characters, punctuation marks, emojis, numbers, and line breaks
# create a function extracting only numbers
numextract <- function(string){ 
  str_extract(string, "\\-*\\d+\\.*\\d*")
} 
# create a function for text cleaning
cleanse_text = function(c) {
  c = tolower(c)
  c = iconv(c)
  c = gsub("[[:punct:][:blank:]]+", " ", c)
  c = stringr::str_replace_all(c, "\r", "")
  c = trimws(c)
  
  # remove mentions, urls, emojis, numbers, punctuations, etc.
  c <- gsub("@\\w+", "", c)
  c <- gsub("https?://.+", "", c)
  c <- gsub("\\d+\\w*\\d*", "", c)
  c <- gsub("#\\w+", "", c)
  c <- gsub("[^\x01-\x7F]", "", c)
  
  # remove spaces and newlines
  c <- gsub("\n", " ", c)
  c <- gsub("^\\s+", "", c)
  c <- gsub("\\s+$", "", c)
  c <- gsub("[ |\t]+", " ", c)
  
  return(c)
}
text_df$cik = as.numeric(str_extract(text_df$cik,pattern="[0-9]+"))
text_df$comp_name = sub(".*:", "", text_df$comp_name)
text_df$comp_name = trimws(gsub('"', "", text_df$comp_name))
text_df$form_type = sub(".*:", "", text_df$form_type)
text_df$filing_date = as.Date(sub(".*:", "", text_df$filing_date))
text_df$access_no = numextract(text_df$access_no)
text_df$text = cleanse_text(text_df$text)
# read the text data
ticker = read.csv("ticker.txt",sep = " ", header = FALSE)
ticker$code = as.numeric(numextract(ticker$V1))
ticker$comp = sub("\\s+\\S+$", '', ticker$V1)
ticker = ticker[-1]
# join the data with the previous data frame
text_df2 = text_df %>% inner_join(ticker, by = c("cik" = "code"))

# feature engineering: create new columns: pre-filing and post-filing dates
text_df2$date_before <- text_df2$filing_date - 7
text_df2$date_after <- text_df2$filing_date + 5
text_df2$date_after_wd <- with(text_df2,  date_after - setNames(rep(0:2, c(5, 1, 1)), 1:7)[format(date_after, "%u")]) # force the post-filing date to be only weekdays

# after checking the data, the company code with CIK 1730168 is duplicate (avgo vs avgop)
# hence, it is removed
text_df2 = text_df2 %>% filter(!comp == "avgop")
# save the file
saveRDS(text_df2, file = "text_df2.rds")

3.3 Lemmatization

  • Let’s load the model using udpipe library.
langmodel_download <- udpipe::udpipe_download_model("english")
langmodel <- udpipe::udpipe_load_model(langmodel_download$file_model)
  • Let’s do annotation.
postagged <- udpipe_annotate(langmodel,
                             text_df2$text,
                             parallel.cores = 10, 
                             trace = 100)
  • We can create a data frame and explore it
# create a data frame of pos-tagged texts
postagged <- as.data.frame(postagged)
head(postagged)

# save the working file as .rds
saveRDS(postagged,file = "postagged.rds")
  • Let’s filter and detokenize the postagged terms.
# read data
postagged = readRDS("postagged.rds")

# filter only nouns, adj, and adv
lematized <- postagged %>% filter(upos %in% c("NOUN",
                                              "ADJ",
                                              "ADV")) %>% select(doc_id,lemma) %>% group_by(doc_id) %>% summarise(documents_pos_tagged = paste(lemma,collapse = " "))
## `summarise()` ungrouping output (override with `.groups` argument)
# create unique id
text_df2 <- text_df2 %>% mutate(doc_id = paste0("doc",row_number()))

# combine tables
text_df3 <- text_df2 %>% left_join(lematized)
## Joining, by = "doc_id"
# save the working file as .rds
saveRDS(text_df3,file = "text_df3.rds")

# check NA values across columns
which(colSums(is.na(text_df3)) > 0)
## documents_pos_tagged 
##                   12

3.4 Unigram

3.4.1 Tokenization

# word tokenization
token_df_all <- text_df3 %>% unnest_tokens(word,documents_pos_tagged)

# check the language and remove non-English words
token_df_all$lan <- cld3::detect_language(token_df_all$word)
token_df_all <- token_df_all %>% filter(lan=="en") 

# check the language using different approach: hunspell
spell_check = hunspell_check(token_df_all$word)
token_df_all = token_df_all[spell_check,]
  • We next back up the file.
token_df_all = token_df_all %>% select(-text)
saveRDS(token_df_all,file = "token_df_all.rds")
  • Let’s inspect text length.
# read data
token_df_all = readRDS("token_df_all.rds")

# calculate the token length 
token_df_all$token_length <- nchar(token_df_all$word)

# let's have a look on the distribution 
token_df_all %>% group_by(token_length) %>% summarise(total =n()) %>% ggplot(aes(x=token_length, y=total)) + geom_col() + labs(x="Token length", y="Frequency", subtitle="Distribution of Token Length")
## `summarise()` ungrouping output (override with `.groups` argument)

# let's have a look at the distribution of tokens again 
token_df_all %>% group_by(token_length) %>% 
  summarise(total =n()) %>% 
  arrange(desc(token_length))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 16 x 2
##    token_length total
##           <int> <int>
##  1           18     2
##  2           17     7
##  3           15    32
##  4           14   214
##  5           13  2241
##  6           12  1646
##  7           11  9274
##  8           10 14334
##  9            9  8488
## 10            8 27454
## 11            7 35013
## 12            6 27779
## 13            5 22522
## 14            4 24541
## 15            3  4331
## 16            2   491
  • We can see that the word length is slightly right skewed where the median is approximately 6-8 characters. – We will remove outliers using IQR approach following [@Hubert].
outliers1 <- boxplot(token_df_all$token_length, ylab = "Token length")$out

# drop the rows containing outliers
token_df_all <- token_df_all[!token_df_all$token_length %in% outliers1,]
# inspect top frequent words
token_df_all %>% count(word, sort=T)

# remove null values in words
token_df_all = token_df_all[!is.na(token_df_all$word),]
  • Let’s remove stop words using a package stopwords [@Benoit].
# load stop words
data("stop_words")
token_df2 = token_df_all %>% anti_join(stop_words, by = "word")

token_df2 %>% count(word, sort=T)
  • Let’s add custom stop words, which are the top 50 words from the inspection above.
# generate new stop words
mystopwords <- tibble(word =c("fiscal", "other", "related", "business", "rate", "product", "note", "payment", "table", "contract", "foreign", "currency", "such", "charge", "time", "purchase", "end","research","support", "policy", "available","party", "lease", "requirement", "recognition", "most","significant", "proceed","solutions","excess","issuance","connection","government","day","act","strategy","state","difference","march","country","experience","consulting","action","page","NA", "revenue", "income","tax","cash","cost","net","financial","net","asset","expense","result","total","share","service","company","increase","management","sale"))

# combine data sets
token_df2 = token_df2 %>% anti_join(mystopwords, by = "word")

# explore top words across corpus
token_df2 %>% group_by(word) %>% 
  summarise(total =n()) %>% 
  arrange(desc(total)) %>% 
  top_n(10) 
## `summarise()` ungrouping output (override with `.groups` argument)
## Selecting by total
## # A tibble: 10 x 2
##    word       total
##    <chr>      <int>
##  1 intangible  2789
##  2 potential    991
##  3 applicable   814
##  4 materially   760
##  5 commercial   619
##  6 comparable   609
##  7 channel      607
##  8 ongoing      596
##  9 collection   586
## 10 delivery     576
  • As a result, top frequent words are more diverse and more adjective and adverb words are observed.
# save file
saveRDS(token_df2,file = "token_df2.rds")

3.4.2 TF-IDF

3.4.2.1 Document level

  • To gain an insight of how important of words is, we use TF-IDF metric to measure word importance in our documents (corpus). – Let’s calculate TF-IDF using tidy data principles. [https://www.tidytextmining.com/tfidf.html]
listing_words = token_df2 %>% count(doc_id, word, sort=T)

total_words <- listing_words %>% 
  group_by(doc_id) %>% 
  summarize(total = sum(n))
## `summarise()` ungrouping output (override with `.groups` argument)
listing_words <- listing_words %>% 
  left_join(total_words)
## Joining, by = "doc_id"
listing_tf_idf <- listing_words %>% bind_tf_idf(word,doc_id,n) %>% select(-total) 
  • Next, we will inspect the data with TF-IDF score.
  • We can see that there is high frequency of words where tf-idf is less than 0.3 across the text in MD&A part.
# plot a histogram
hist(listing_tf_idf$tf_idf,breaks = 80,main="TF-IDF plot", xlim = c(0,1))

  • Let’s zoom in the data where TF-IDF < 0.2
listing_tf_idf <- listing_tf_idf %>% 
  filter(tf_idf<=0.2)

# plot a histogram
hist(listing_tf_idf$tf_idf,breaks = 80,main="TF-IDF plot", xlim = c(0,0.2))

  • We can see that common words are found the most where tf-idf is equal to 0.01-0.02.
listing_tf_idf_2 <- listing_tf_idf %>% 
  filter(tf_idf<=0.2)

# statistical summary
summary(listing_tf_idf_2$tf_idf)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0002382 0.0040371 0.0077939 0.0115862 0.0141520 0.1993487
  • The summary above presents that the data is right skewed and hence we need to remove the outliers in the new step.
  • However, let’s check the data based on TF-IDF scores.
# let's explore the top 10 words with the highest tf-idf
listing_tf_idf_2 %>% 
  group_by(word) %>% 
  arrange(desc(tf_idf)) %>% 
  top_n(10)
## Selecting by tf_idf
## # A tibble: 3,731 x 6
## # Groups:   word [577]
##    doc_id word            n     tf   idf tf_idf
##    <chr>  <chr>       <int>  <dbl> <dbl>  <dbl>
##  1 doc52  absolute       11 0.0780  2.56  0.199
##  2 doc99  contraction    18 0.0769  2.49  0.192
##  3 doc67  spin           10 0.0763  2.49  0.190
##  4 doc83  discover        9 0.0570  3.25  0.185
##  5 doc18  inference       3 0.0395  4.63  0.183
##  6 doc100 contraction    19 0.0706  2.49  0.176
##  7 doc129 booking        12 0.0764  2.28  0.175
##  8 doc101 contraction    20 0.0692  2.49  0.173
##  9 doc85  acquirer       12 0.0732  2.33  0.171
## 10 doc128 booking        12 0.0745  2.28  0.170
## # ... with 3,721 more rows
  • From the top highest tf-idf words, they are mostly nouns, which is relatively difficult to understand the context compared with adjectives and adverbs.
# let's explore the top 10 words with the lowest tf-idf
listing_tf_idf_2 %>% 
  group_by(word) %>% 
  arrange(tf_idf) %>% 
  top_n(10) 
## Selecting by tf_idf
## # A tibble: 3,731 x 6
## # Groups:   word [577]
##    doc_id word            n     tf    idf  tf_idf
##    <chr>  <chr>       <int>  <dbl>  <dbl>   <dbl>
##  1 doc15  conjunction     2 0.0198 0.157  0.00312
##  2 doc91  conjunction     2 0.0213 0.157  0.00335
##  3 doc1   conjunction     3 0.0224 0.157  0.00352
##  4 doc175 materially      3 0.0469 0.0756 0.00354
##  5 doc70  conjunction     3 0.0227 0.157  0.00358
##  6 doc170 sufficient      2 0.025  0.146  0.00365
##  7 doc148 sufficient      2 0.0253 0.146  0.00370
##  8 doc149 sufficient      2 0.0256 0.146  0.00375
##  9 doc185 materially     11 0.0505 0.0756 0.00381
## 10 doc81  sufficient      4 0.0261 0.146  0.00382
## # ... with 3,721 more rows
  • Rare words i.e., low frequent words, don’t provide much insight.
  • Let’s remove outliers.
outliers1.1 <- boxplot(listing_tf_idf_2$tf_idf, ylab = "TF-IDF")$out

summary(listing_tf_idf_2$tf_idf)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0002382 0.0040371 0.0077939 0.0115862 0.0141520 0.1993487
# drop the rows containing outliers
listing_tf_idf_3 <- listing_tf_idf_2 %>% filter(tf_idf <= 0.1 & tf_idf >= 0.0002)


listing_tf_idf_3 %>% 
    group_by(word) %>% 
    arrange(desc(tf_idf)) %>% 
    top_n(10)
## Selecting by tf_idf
## # A tibble: 3,714 x 6
## # Groups:   word [577]
##    doc_id word             n     tf   idf tf_idf
##    <chr>  <chr>        <int>  <dbl> <dbl>  <dbl>
##  1 doc171 discontinued     5 0.0617 1.61  0.0996
##  2 doc82  limited         13 0.129  0.774 0.0996
##  3 doc146 unspecified      7 0.0530 1.86  0.0987
##  4 doc13  game             9 0.0446 2.19  0.0977
##  5 doc123 converted       14 0.0287 3.38  0.0972
##  6 doc30  petition         4 0.0284 3.38  0.0959
##  7 doc181 conference       3 0.0333 2.84  0.0948
##  8 doc102 unfunded         8 0.0491 1.93  0.0946
##  9 doc145 unspecified      7 0.0504 1.86  0.0938
## 10 doc188 stable          11 0.0512 1.83  0.0937
## # ... with 3,704 more rows

3.4.2.2 Industry level

  • In this part, we will use tokens that we removed from the previous part as they are rare and common words based on TF-IDF scores.
token_df3 = token_df2 %>% semi_join(listing_tf_idf_3, by = c("doc_id","word"))

# save file
saveRDS(token_df3,file = "token_df3.rds")
  • Let’s explore word frequency by industry.
token_df3 = token_df3 %>%
  mutate(industry=case_when(
    cik %in% as.numeric(dp_outsource) ~ "Data Processing & Outsourced Services",
    cik %in% as.numeric(app_software) ~ "Application Software",
    cik %in% as.numeric(it_con) ~ "IT Consulting & Other Services",
    cik %in% as.numeric(semi_con) ~ "Semiconductors",
    cik %in% as.numeric(tech_hw) ~ "Technology Hardware, Storage & Peripherals"
  )) %>%
  mutate(industry=industry %>% as.factor())
  • xxx
# word counts
industry_words <- token_df3 %>%
count(industry, word, sort = TRUE)

# total word counts
total_industry_words <- industry_words %>%
group_by(industry) %>%
summarize(total = sum(n))

# left join
industry_words <- industry_words %>%
left_join(total_industry_words)
# visualization of the frequency
industry_words %>%
  mutate(tf = n/total) %>%
  ggplot(aes(x=tf,fill=industry)) +
  geom_histogram(show.legend = FALSE)+
  facet_wrap(~industry,ncol=3,scales = "free_y")

  • From the figure below, we can see that words with low frequency are the most common across sub industries.

  • Let’s apply Zipf’s law

industry_words %>%
  group_by(industry) %>%
  mutate(rank = row_number(),
  tf = n/total) %>%
  ungroup() -> zipf_data

# plot using Zipf's law
zipf_data %>%
  ggplot(aes(rank, tf, color = industry)) +
  geom_line(size = 1.1, alpha = 0.8, show.legend = TRUE) +
  scale_x_log10() +
  scale_y_log10()

  • Notice that the significant deviation among sub industries in the low ranks is uncommon. In addition, it signifies that MD&A discussion uses a lower percentage of the most common words than many collections of language.

  • Inspect the distribution of TF-IDF values.

# calculating TF-IDF scores
industry_tf_idf <- industry_words %>%
  bind_tf_idf(word,industry,n) %>%
  select(-total)

options(scipen = 1000)

# tf-idf frequency
industry_tf_idf %>% ggplot(aes(tf_idf)) + geom_histogram(bins = 50)

# check the statistical summary
summary(industry_tf_idf$tf_idf)
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## 0.00000000 0.00000000 0.00005751 0.00029156 0.00034059 0.01073434
  • The data is highly right skewed – common words are found with low TF-IDF values.
  • Visualization
# create own function to wrap texts
swr = function(string, nwrap=20) {
  paste(strwrap(string, width=nwrap), collapse="\n")
}
swr = Vectorize(swr)
industry_tf_idf %>%
  group_by(industry) %>%
  slice_max(tf_idf,n=10,with_ties = F) %>%
  ungroup() %>%
  mutate(industry = swr(industry), row = row_number(), word2 = fct_reorder(word, tf_idf)) -> n

n %>%
  ggplot(aes(tf_idf, word2, fill = industry)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~industry, ncol = 3, scales = "free") +
  labs(x = "tf-idf", y = "Top 10 Words")

  • Overall, from the plot above, the top words with the highest TF-IDF for each sub industry illustrate words that potentially specify sub industries at some degree. – For example, Application Software contains subscriber, unlimited, which indicates a business model in the industry. Semiconductors also shows industry-specific words such as dilution, offshore

3.5 Bi-grams

3.5.1 Tokenization

  • Let’s tokenize data using bigrams and remove stop words.
token_df_bi <- text_df3 %>% 
  unnest_tokens(bigram, documents_pos_tagged, token="ngrams",n=2) %>% 
  separate(bigram, c("word1","word2"), sep = " ") %>% 
  filter(!word1 %in% stop_words$word) %>% 
  filter(!word1 %in% mystopwords$word) %>% 
  filter(!word2 %in% stop_words$word) %>% 
  filter(!word2 %in% mystopwords$word) %>% 
  unite(bigram, word1, word2, sep = " ")

# check the language again and remove non-English words
token_df_bi$lan <- cld3::detect_language(token_df_bi$bigram)
token_df_bi <- token_df_bi %>% filter(lan=="en") 

# remove unused features
token_df_bi %>% select(-text,-lan) -> token_df_bi

# save file
saveRDS(token_df_bi,file = "token_df_bi.rds")
  • Explore the most common bi-grams (two words) at a corpus level.
# top 20 words
token_df_bi %>% 
  count(bigram) %>% 
  mutate(bigram=reorder(bigram,n)) %>% 
  slice_max(n,n=20) %>% 
  ggplot(aes(bigram,n)) + 
  geom_bar(stat = "identity") +
  xlab(NULL) +
  coord_flip()

  • Words used in finance and accounting are observed such as discussion analysis, report form, accounting, investment, and commercial.
  • We could also observe that these bigrams generate low level of insights. Hence, we need to remove common or frequent words from the corpus.

3.5.2 TF-IDF

3.5.2.1 Document level

  • xxx
# calculating TF-IDF scores
bi_words = token_df_bi %>% count(doc_id, bigram, sort=T)

total_words <- bi_words %>% 
  group_by(doc_id) %>% 
  dplyr::summarize(total = sum(n))
## `summarise()` ungrouping output (override with `.groups` argument)
# joining data
bi_words <- bi_words %>% 
  left_join(total_words)
## Joining, by = "doc_id"
bi_tf_idf <- bi_words %>% bind_tf_idf(bigram,doc_id,n) %>% select(-total) 
# let's explore the top 10 words with the lowest tf-idf
bi_tf_idf %>% 
  group_by(bigram) %>% 
  arrange(desc(tf_idf)) %>% 
  top_n(10) 
## Selecting by tf_idf
## # A tibble: 42,312 x 6
## # Groups:   bigram [14,053]
##    doc_id bigram                    n     tf   idf tf_idf
##    <chr>  <chr>                 <int>  <dbl> <dbl>  <dbl>
##  1 doc119 material adverse          1 1       1.11  1.11 
##  2 doc120 material adverse          1 1       1.11  1.11 
##  3 doc82  public limited           12 0.096   4.27  0.410
##  4 doc43  december comparable      31 0.0891  3.58  0.319
##  5 doc105 software subscription    73 0.0996  2.97  0.296
##  6 doc103 holding annual           16 0.0669  3.98  0.267
##  7 doc20  stock indian             20 0.0604  4.27  0.258
##  8 doc19  stock indian             24 0.0578  4.27  0.247
##  9 doc98  science application      19 0.0648  3.76  0.244
## 10 doc102 holding annual           18 0.0596  3.98  0.237
## # ... with 42,302 more rows
  • High range of TF-IDF values is observed from 0.24 to 1.11. – All words are nouns with low level of insights. – Therefore, we need to remove common and rare words i.e., outliers.
outliers1.2 <- boxplot(bi_tf_idf$tf_idf, ylab = "TF-IDF")$out

# check statistical summary 
summary(bi_tf_idf$tf_idf)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000118 0.0078163 0.0115818 0.0136100 0.0164240 1.1079582
# drop the rows containing outliers
bi_tf_idf_2 <- bi_tf_idf %>% filter(tf_idf <= 0.2 & tf_idf >= 0.00002)


bi_tf_idf_2 %>% 
    group_by(bigram) %>% 
    arrange(desc(tf_idf)) %>% 
    top_n(10)
## Selecting by tf_idf
## # A tibble: 42,301 x 6
## # Groups:   bigram [14,050]
##    doc_id bigram                   n     tf   idf tf_idf
##    <chr>  <chr>                <int>  <dbl> <dbl>  <dbl>
##  1 doc99  science application     17 0.0514  3.76  0.193
##  2 doc176 month month             15 0.0758  2.54  0.192
##  3 doc70  information security    19 0.0688  2.73  0.188
##  4 doc177 month month             14 0.0722  2.54  0.183
##  5 doc89  impact special           8 0.0408  3.98  0.163
##  6 doc99  growth contraction      13 0.0393  3.98  0.156
##  7 doc137 waiver exclusivity      10 0.0365  4.27  0.156
##  8 doc70  cyber incident           8 0.0290  5.37  0.156
##  9 doc187 interactive solution    11 0.0330  4.68  0.155
## 10 doc100 growth contraction      15 0.0376  3.98  0.150
## # ... with 42,291 more rows
  • After removing outliers, bi-grams provide more insights where they generate industry-specific words such as science application, information security, cyber incident.

3.5.2.2 Industry level

  • Let’s create a tokenized data frame using the data from the previous part.
# joining data tables
token_df_bi2 = token_df_bi %>% semi_join(bi_tf_idf_2, by = c("doc_id","bigram"))

# save file
saveRDS(token_df_bi2,file = "token_df_bi2.rds")
# create a new column of sub industry
token_df_bi2 = token_df_bi2 %>%
  mutate(industry=case_when(
    cik %in% as.numeric(dp_outsource) ~ "Data Processing & Outsourced Services",
    cik %in% as.numeric(app_software) ~ "Application Software",
    cik %in% as.numeric(it_con) ~ "IT Consulting & Other Services",
    cik %in% as.numeric(semi_con) ~ "Semiconductors",
    cik %in% as.numeric(tech_hw) ~ "Technology Hardware, Storage & Peripherals"
  )) %>%
  mutate(industry=industry %>% as.factor())
  • xxx
# word counts
industry_bi <- token_df_bi2 %>%
count(industry, bigram, sort = TRUE)

# total word counts
total_industry_bi <- industry_bi %>%
group_by(industry) %>%
summarize(total = sum(n))

# left join
industry_bi <- industry_bi %>%
left_join(total_industry_bi)
# TF-IDF calculation
industry_tf_idf_bi <- industry_bi %>%
  bind_tf_idf(bigram,industry,n) %>%
  select(-total)
  • Visualization
options(scipen = 100
        )

industry_tf_idf_bi %>%
  group_by(industry) %>%
  slice_max(tf_idf,n=10,with_ties = F) %>%
  ungroup() %>%
  mutate(industry = swr(industry), row = row_number(), word2 = fct_reorder(bigram, tf_idf)) -> n

n %>%
  ggplot(aes(tf_idf, word2, fill = industry)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~industry, ncol = 3, scales = "free") +
  labs(x = "tf-idf", y = "Top 10 Words")

  • Bi-grams generate more insights since the top words are more coherent. – Overall, there are words with distinctive TF-IDF values in App, Semi Con, and DP. – In the App industry, it is probably that oracle system is the main software that is widely used. We can also see a broader context compared with unigram from subscription to software subscription or ‘forfeiture’ that is added before cancellation. – For IT Consult, we might assume that the industry does not perform well across years since the top words are growth contraction and contraction percentage. Moreover, the words ‘delivery model’ and ‘global delivery’ potentially signify how the industry works since companies are an agency who needs to deliver expertise to clients. – For DP, the distinctive words are card network, information security, investment security, which might indicate a concern in the industry. – For Semi Con, the top words are industry-specific words such as wafer foundry, game console, waiver-related words. There are also words describing the industry such as manufacturing assembly, portion manufacturing. – Top words in Tech Hardware are potentially about sales and revenue, which are channel mix, conversion cycle, drive industry, price protection, and geography channel.
# remove unused files
remove(postagged)
remove(token_df_all)
remove(token_df_bi)
remove(token_df2)
remove(bi_tf_idf)
remove(bi_words)

4 Sentiment analysis

This section illustrates how text-derived features, especially polarity, are related to the stock price difference. In addition, [@Mckay] found that quarterly conference call tone is a significant estimator of unusual stock returns. Furthermore, [@Griffin] evidenced that investors respond to EDGAR 10-K filings in a short period (0-2 days) upon the filings are available. However, we would like to see a longer-term effect. Hence, we will use a time windows of 7 days i.e., a week, and inspect whether they have a similar effect or not.

The approach starts with retrieving closing stock price for pre-filing and post-filing dates. In addition, we create a new feature called stock price difference which is defined as a difference between 2 different time windows: a) pre-filing date: 7 days before the filing date, which is the date that 10-K form is submitted on EDGAR, and b) post-filing date, which is the date after 5 days of the filing date. Next, related-text variables are extracted, which are sentiment, polarity, formality, and readability. Lastly, to inspect the relationship and the effect on the price difference, several regression models are implemented using fixed effect, random effect, and linear multiple regression models. The approach is summarized in the figure below.

Figure 5 Summary of sentiment analysis approach

4.2 Analysis

4.2.1 Dictionaries

  • Which dict should we use? – Let’s first explore the relationship among them. – Polarity is used as a proxy of tone in the context since it is a net effect from positive and negative sentiments.
cor(text_df7[, c("SentimentLM", "SentimentHE", "SentimentQDAP", "SentimentGI")])
##               SentimentLM SentimentHE SentimentQDAP SentimentGI
## SentimentLM     1.0000000  0.11639660    0.36864027   0.4869805
## SentimentHE     0.1163966  1.00000000   -0.08281281  -0.2691861
## SentimentQDAP   0.3686403 -0.08281281    1.00000000   0.8110296
## SentimentGI     0.4869805 -0.26918607    0.81102960   1.0000000
  • GI dictionary is highly related to QDAP and LM dictionaries, respectively.
  • Let’s compare the polarity among dictionaries across years.
text_df7 %>% pivot_longer(cols = c("SentimentGI","SentimentLM","SentimentQDAP","SentimentHE"), names_to = "dict", values_to = "sentiment") %>%
  ggplot(aes(x = as.numeric(year), y = sentiment, group=dict, colour = dict)) + 
  geom_point(show.legend = F) + 
  geom_smooth() +
  facet_wrap(~dict, ncol = 1, scales = "free_y")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  • GI dictionary presents biased high scores of polarity across times where the average polarity value is 0.1 while others show lower polarity. – For example, the average polarity from QDAP dictionary is 0.05 across years while for LM dictionary, it illustrates negative polarity.
  • Let’s check the polarity for each industry across periods.
text_df7 %>% pivot_longer(cols = c("SentimentGI","SentimentLM","SentimentQDAP","SentimentHE"), names_to = "dict", values_to = "sentiment") %>%
  ggplot(aes(x = as.numeric(year), y = sentiment, group=dict, colour = dict)) + 
  geom_point(show.legend = F) +
  geom_smooth() +
  facet_wrap(~industry, ncol = 2, scales = "free_y")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  • Overall, GI dictionary provides the highest polarity, followed by QDAP, HE, and LM dictionaries across industries.
  • For each industry, all dictionaries provide the same polarity pattern across the periods, in general, except for the DP industry where the polarity value from the HE dictionary deviates from the others.
  • However, Naldi (2019) suggested to use the polarity extracted from using SentimentR package, which uses lexicon developed by Jockers and Rinker. This is because the lexicon provides higher accuracy compared with the dictionaries mentioned above, where they treat negators less efficiently.
# create a new data frame retrieving polarity values
sent_df = tibble()

for (i in 1:nrow(text_df7)){
  print(i)
  t = text_df7$text[i] %>% get_sentences() %>% sentimentr::sentiment_by()
  sent_df = bind_rows(sent_df, t)
}

sent_df = sent_df %>% mutate(doc_id = paste0("doc",row_number()))
# joining data
text_df8 = text_df7 %>% left_join(sent_df) %>% select(-sd,-element_id)
## Joining, by = "doc_id"
  • Let’s inspect the correlation across dictionaries.
cor(text_df8[, c("SentimentLM", "SentimentHE", "SentimentQDAP", "SentimentGI", "ave_sentiment")])
##               SentimentLM SentimentHE SentimentQDAP SentimentGI ave_sentiment
## SentimentLM     1.0000000  0.11639660    0.36864027   0.4869805   -0.10061405
## SentimentHE     0.1163966  1.00000000   -0.08281281  -0.2691861    0.47655258
## SentimentQDAP   0.3686403 -0.08281281    1.00000000   0.8110296   -0.04365885
## SentimentGI     0.4869805 -0.26918607    0.81102960   1.0000000   -0.22270385
## ave_sentiment  -0.1006140  0.47655258   -0.04365885  -0.2227038    1.00000000
  • We see that the polarity score from the new dictionary is moderately related to the result from HE dictionary since they possibly have word overlap.
  • Let’s see a box plot comparing polarity across industries.
# box plot
text_df8 %>% ggplot(aes(x = industry, y = ave_sentiment, color = industry)) + labs(x="Industry", y="Polarity") +
  geom_boxplot() + scale_x_discrete(labels = NULL, breaks = NULL)

  • The plot reveals that the App industry has the highest average polarity, followed by DP, IT Consult, Semi Con, and Tech Hardware, respectively.

4.2.2 Relationship with polarity

  • Let’s see the relationship between polarity and years.
text_df8 %>%
  filter(!year == "2020") %>%
  ggplot(aes(x = as.numeric(year), y = ave_sentiment)) + 
  geom_point(show.legend = F) + 
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  • From the figure above, there is potentially no relationship between them, which requires further inspection.
  • Let’s use the average polarity instead.
text_df8 %>% 
  group_by(year) %>% 
  summarize(sent_avg = mean(ave_sentiment)) %>%
  filter(!year == "2020") %>% ggplot(aes(x=as.numeric(year),y=sent_avg)) + geom_line() + xlab("Year") + ylab("Average Polarity")
## `summarise()` ungrouping output (override with `.groups` argument)

  • According to the figure above, we see that the polarity scores fluctuate across the periods.
  • Let’s dig deeper by looking at sub industry level.
text_df8 %>%
  filter(!year == "2020") %>%
  ggplot(aes(x = as.numeric(year), y = ave_sentiment, group=industry, colour=industry)) + 
  geom_point(show.legend = F) + 
  geom_smooth() +
  facet_wrap(~industry, ncol = 2, scales = "free_y") + 
  xlab("Year") + ylab("Average Polarity")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  • Overall, DP and Tech Hardware, have higher polarity across periods while the polarity scores of Semi Con industry presents a downward trend; the score of the App industry seems to have an upward trend; and the polarity of IT Consult reveals a steady trend.
  • Let’s use the average polarity scores.
# xxx
text_df8 %>% 
  group_by(year,industry) %>% 
  summarize(sent_avg = mean(ave_sentiment)) %>%
  filter(!year == "2020") %>% ggplot(aes(x=as.numeric(year),y=sent_avg, group=industry, colour = industry)) + geom_line() + xlab("Year") + ylab("Average Polarity")
## `summarise()` regrouping output by 'year' (override with `.groups` argument)

  • The same patterns are observed regarding the average polarity scores.

4.2.3 Correlation

  • Let’s check correlation of numeric variables.
# new data frame
text_df9 = text_df8 %>% select(-text,-documents_pos_tagged, -WordCount,-word.count,-cik,-NegativityGI,-NegativityHE,-NegativityLM,-NegativityQDAP,-PositivityGI,-PositivityHE,-PositivityLM,-PositivityQDAP,-RatioUncertaintyLM)

# filter only numerical features
text_df9 %>% select_if(is.numeric) -> df_numeric
correlate(df_numeric) -> cordata
## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
# explore variables which have higher correlation value than |0.1|
cordata %>% select(term, sp_diff) %>% filter(sp_diff > 0.1 | sp_diff < -0.1) %>% arrange(desc(sp_diff)) 
## # A tibble: 0 x 2
## # ... with 2 variables: term <chr>, sp_diff <dbl>
  • From the correlation, we could not see features with either a strong or moderate relationship with the stock price difference.
  • Let’s split the data into 2 data sets - numerical and categorical data sets.
# non-numerical data set
df_etc = text_df9 %>% select_if(negate(is.numeric)) %>% select(-comp_name,-form_type,-access_no,-doc_id,-filing_date,-date_before,-date_after,-date_after_wd) %>% mutate(comp = as.factor(comp), year = year(as.Date(as.character(year), format = "%Y")))

# numerical data set
df_numeric2 = df_numeric %>% select(-sentence.count,-sp_before,-sp_after)

4.2.4 Pre-processing data

  • Now, let’s do data normalization to use in a regression model
preproc2 <- preProcess(df_numeric2, method=c("range"))
norm2 <- predict(preproc2, df_numeric2)
  • Let’s combine the data.
df_reg = cbind(norm2, df_etc)
  • We have three main variables extracted from comments - formality, reading grade level, and reading ease level.
  • Let’s see the relationships to the stock price difference, starting with the formality.
# explore the distribution
hist(df_reg$formality)

hist(df_reg$sp_diff)

  • The average score of formality is approximately 0.3 while the stock price difference is 0.4-0.5.
# relationship 
df_reg %>% select(formality,sp_diff) %>% na.omit() %>% ggplot(aes(x=formality,y= sp_diff))+geom_smooth(method="loess",se = F)+ geom_point() + labs(x="Formality", y="Price difference", subtitle="Price Difference and Formality Relationship")
## `geom_smooth()` using formula 'y ~ x'

  • From the result above, there is no clear relationship.
  • Now, let’s explore reading grade level.
hist(df_reg$FK_grd.lvl)

# check the statistical summary of FK_grd.lvl
df_reg$FK_grd.lvl %>% summary(.)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2717  0.3437  0.3557  0.4076  1.0000
# plot reading grade level against the price diff
df_reg %>% select(FK_grd.lvl, sp_diff) %>% na.omit() %>% ggplot(aes(x=FK_grd.lvl,y= sp_diff))+geom_smooth(method="loess",se = F)+geom_point() + labs(x="Reading grade level", y="Price difference", subtitle="Price Difference and Reading Grade Level Relationship")

  • From the graph, no noticable relationship is observed between the price difference and the reading grade level.
  • Now, let’s explore the reading ease level.
# price diff vs FK reading ease
df_reg %>% select(FK_read.ease,sp_diff) %>% na.omit() %>% ggplot(aes(x=FK_read.ease,y= sp_diff))+geom_smooth(method="loess",se = F)+geom_point() + labs(x="Reading ease", y="Price difference", subtitle="Price Difference and Reading Ease Relationship")

  • No noticeable relationship is observed between the two features.

4.2.5 Relationship with stock price difference

# price diff vs polarity
df_reg %>% filter(!year == "2020") %>% ggplot(aes(x=ave_sentiment,y=sp_diff)) + geom_point() + stat_smooth() + labs(x="Polarity", y = "Price Difference")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  • It is possible that polarity is not related to the stock price difference.
df_reg %>% filter(!year == "2020") %>% mutate(year = as.numeric(as.character(year))) %>% ggplot(aes(x=year,y=sp_diff)) + geom_point() + stat_smooth() + labs(x="Year", y = "Price Difference") 
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  • Price difference is potentially independent across the periods.
df_reg %>% filter(!year == "2020") %>% mutate(year = as.numeric(as.character(year))) %>% ggplot(aes(x=industry,y=sp_diff,color = industry)) + geom_boxplot() + labs(x="Industry", y = "Price Difference") + scale_x_discrete(labels = NULL, breaks = NULL)

  • High variation in price difference is observed for the App and IT Consult industries whereas the average price difference are similar across the industries.

4.2.6 Regression Analysis

  • Let’s remove the year 2020 since it is the only sample in the time period.
  • We will create 2 data frames, panel and independent data, and compare their performance. – This is because time and observations are potentially dependent, which will impact the predictability of the stock price difference.
df_reg %>% count(year, sort=T)
df_reg = df_reg %>% filter(!year == '2020')

# original data converting ‘year’ as factor
df_reg = df_reg %>% mutate(year = as.factor(year))

# create panel data frame
df_reg2 = pdata.frame(df_reg, index = c("comp", "year"))
library(mgcv)
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## The following object is masked from 'package:NCmisc':
## 
##     Dim
## This is mgcv 1.8-33. For overview type 'help("mgcv-package")'.
## 
## Attaching package: 'mgcv'
## The following object is masked from 'package:stm':
## 
##     s
# Non-linear model
model0 <- gam(sp_diff ~ s(ave_sentiment) + industry + year + formality + FK_read.ease, data = df_reg)

# OLS model
model01 <- lm(sp_diff ~ ave_sentiment + industry + year + formality + FK_read.ease, data = df_reg)

# fixed effect model
model1 <- plm(sp_diff ~ ave_sentiment + industry + formality + FK_read.ease, data = df_reg2, 
                    index = c("comp", "year"), 
                    effect = "twoways", model = "within")

# random effect model
model2 <- plm(sp_diff ~ ave_sentiment + industry + formality + FK_read.ease, data = df_reg2, 
                    index = c("comp", "year"), 
                    effect = "twoways", model = "random")

# pooled model
model3 <- plm(sp_diff ~ ave_sentiment + industry + formality + FK_read.ease, data = df_reg2, 
                    index = c("comp", "year"), 
                    effect = "twoways", model = "pooling")
  • Let’s check the model performance using several tests following [Torres-Reyna’s approach](Panel Data using R (princeton.edu).
# test fixed-effect vs random effect models
phtest(model1,model2) # p-value > 0.05, use random effect
## 
##  Hausman Test
## 
## data:  sp_diff ~ ave_sentiment + industry + formality + FK_read.ease
## chisq = 0.0047838, df = 3, p-value = 0.9999
## alternative hypothesis: one model is inconsistent
# Breusch-Pagan Lagrange multiplier test
plmtest(model3, type=c("bp")) # p-value >0.05, use OLS
## 
##  Lagrange Multiplier Test - (Breusch-Pagan) for unbalanced panels
## 
## data:  sp_diff ~ ave_sentiment + industry + formality + FK_read.ease
## chisq = 3.2726, df = 1, p-value = 0.07044
## alternative hypothesis: significant effects
  • From the above tests, they suggested that the time effect is not significant (panel effect) and independent. Therefore, we will use linear multiple regression models to see the effect on the stock price.
model4 <- lm(sp_diff ~ ave_sentiment + industry + year + formality, data = df_reg)

model5 <- lm(sp_diff ~ ave_sentiment + industry + year, data = df_reg)

model6 <- lm(sp_diff ~ ave_sentiment + industry, data = df_reg)

stargazer::stargazer(model01,model4,model5,model6,type = "text", single.row = TRUE, # to put coefficients and standard errors on same line
          no.space = TRUE, # to remove the spaces after each line of coefficients
          omit.stat=c("f", "ser"),
          digits = 2,
          column.sep.width = "0pt",
          font.size = "tiny", # to make font size smaller
          out = "modelresult0.html",
          covariate.labels=c("sentiment","DP","IT Consult","Semi Con","Tech Hardware",
                             "2011","2012","2013","2014","2015","2016","2017","2018","2019"," formality","readingeaselevel","Intercept"))

Figure 6 Summary of regression models

  • For model selection, it is recommended to check the adjusted R squared. – The metric score, 0.07, is similar for model01, model4, and model5. – However, we have seen from the previous part that reading ease level and formality have no relation to the difference in stock price. Therefore, model 5 is preferred for simplicity purpose.
  • From the result, the predictors could explain with 7% predictability of the variation of the stock price difference.
  • However, there are no predictors containing significant effect to the price difference. – This indicates that there are other variables, which could better explain the price difference apart from polarity, year, and industry. – Another possible reason is the sample size is small, especially for each industry. This requires additional exploration. – Nevertheless, we can observe the sign of the coefficient of the polarity, which shows positive relation to the stock price difference.

5 Topic Modelling

The section shows how to construct the pipeline for topic modelling. It starts with data preparation and corpus construction. Next, searching for the optimal number of topics are required using semantic coherence and exclusivity as metrics for the selection. After that, the structural topic model (stm) is fitted using the optimal number of topics and hence the analysis is generated, where it presents the associations of topic-document and topic-word. In addition, the marginal effect is estimated to see the relationship between topics and related features (covariates) such as the price difference, polarity, year. Lastly, multiple linear regression models are implemented to see the effect of topics on the stock price difference. The approach summary can be found in the figure below.

Figure 7 Summary of topic modelling analysis approach ## Data preparation

postagged = readRDS("postagged.rds")

lematized <- postagged %>% filter(upos %in% c("NOUN",
                                              "ADJ",
                                              "ADV"))

saveRDS(lematized, file = "lematized.rds")
# read file
lematized = readRDS("lematized.rds")

# explore top words across corpus
lematized %>% group_by(lemma) %>% 
  summarise(total =n()) %>% 
  arrange(desc(total)) %>% 
  top_n(30)
## `summarise()` ungrouping output (override with `.groups` argument)
## Selecting by total
## # A tibble: 30 x 2
##    lemma   total
##    <chr>   <int>
##  1 revenue 20871
##  2 fiscal  16650
##  3 tax     16181
##  4 year    15614
##  5 income  12510
##  6 cash    12052
##  7 expense 11482
##  8 net     11348
##  9 other   10732
## 10 service  9629
## # ... with 20 more rows
# create new data frame
lematized_2 = lematized %>% count(doc_id, lemma, sort=T)

lematized_2_total <- lematized_2 %>% 
  group_by(doc_id) %>% 
  summarize(total = sum(n))
## `summarise()` ungrouping output (override with `.groups` argument)
lematized_2 <- lematized_2 %>% 
  left_join(lematized_2_total)
## Joining, by = "doc_id"
# save file
saveRDS(lematized_2, file = "lematized_2.rds")
# TF-IDF calculation
lem_tf_idf <- lematized_2 %>% bind_tf_idf(lemma,doc_id,n) %>% select(-total) 
# plot a histogram
hist(lem_tf_idf$tf_idf,breaks = 50,main="TF-IDF plot", xlim = c(0,1))

# statistical summary
summary(lem_tf_idf$tf_idf) # outliers are observed considering Q1-Q4 with the median equals only 0.00027.
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000000 0.0001357 0.0002704 0.0005544 0.0005293 0.4229301
# let's filter the data
lem_tf_idf_2 <- lem_tf_idf %>% 
  filter(tf_idf<=0.005 & tf_idf>=0.0001)

hist(lem_tf_idf_2$tf_idf,breaks = 20,main="TF-IDF plot", xlim = c(0,0.005))

# let's explore the top 30 words
lem_tf_idf_2 %>% 
  group_by(lemma) %>% 
  arrange(desc(tf_idf)) %>% 
  top_n(30)
## Selecting by tf_idf
## # A tibble: 76,738 x 6
## # Groups:   lemma [7,042]
##    doc_id lemma         n      tf   idf  tf_idf
##    <chr>  <chr>     <int>   <dbl> <dbl>   <dbl>
##  1 doc108 big          22 0.00178 2.81  0.00500
##  2 doc123 escrow       33 0.00236 2.11  0.00499
##  3 doc104 fiscal      282 0.0257  0.194 0.00499
##  4 doc199 wafer        19 0.00342 1.46  0.00499
##  5 doc203 atmel         7 0.00125 3.98  0.00499
##  6 doc203 issc          7 0.00125 3.98  0.00499
##  7 doc114 hedge        23 0.00380 1.31  0.00498
##  8 doc86  card         21 0.00504 0.989 0.00498
##  9 doc162 ancillary     5 0.00132 3.76  0.00497
## 10 doc201 assembly     19 0.00341 1.46  0.00497
## # ... with 76,728 more rows
mystopwords <- tibble(word =c("fiscal", "other", "related", "business", "rate", "product", "note", "payment", "table", "contract", "foreign", "currency", "such", "charge", "time", "purchase", "end","research","support", "policy", "available","party", "lease", "requirement", "recognition", "most","significant", "proceed","solutions","excess","issuance","connection","government","day","act","strategy","state","difference","march","country","experience","consulting","action","page","NA", "revenue", "income","tax","cash","cost","net","financial","net","asset","expense","result","total","share","service","company","increase","management","sale"))

allsw = stop_words %>% full_join(mystopwords)
## Joining, by = "word"
lemma = lem_tf_idf_2 %>% select(doc_id,lemma) %>% anti_join(allsw, by = c("lemma" = "word"))

# construct a sentence from lemmatized 
lemma_2 <- lemma %>% group_by(doc_id) %>% summarise(text_tagged = paste(lemma,collapse = " "))
## `summarise()` ungrouping output (override with `.groups` argument)
# create a data frame for the coming analysis
text_df10 <- text_df8 %>% left_join(lemma_2)
## Joining, by = "doc_id"
# inspect NA values
na_rows = subset(text_df10, is.na(text_df10$text_tagged)) %>% select(comp_name,comp,filing_date,text_tagged)

# remove unused data
remove(postagged)
## Warning in remove(postagged): object 'postagged' not found
remove(lematized)
remove(lematized_2)
remove(lem_tf_idf)
remove(lem_tf_idf_2)

5.1 Corpus construction

In this part, we generate 4 new variables since our corpus and the coming analysis will be on industry and year levels. Those are aggregated text column (tp), the average of difference in stock price (sp_avg), the average formality scores (form_avg), and the average polarity (sent_avg).

toprocess = text_df10 %>% 
  select(industry, year,text_tagged,sp_diff,ave_sentiment,formality) %>%
  group_by(industry,year) %>%
  summarize(tp = paste(text_tagged, collapse = " "),
            sp_avg = mean(sp_diff),
            form_avg = mean(formality),
            sent_avg = mean(ave_sentiment)) %>%
  filter(!year == "2020") %>%
  mutate(doc_id = paste("doc",row_number()),
         year = as.numeric(year))
## `summarise()` regrouping output by 'industry' (override with `.groups` argument)
# prepare the corpus used for stm estimation
processed <- textProcessor(toprocess$tp,
                           metadata = toprocess,
                           customstopwords = allsw$word,
                           stem = F)
## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Remove Custom Stopwords...
## Removing numbers... 
## Creating Output...
  • Let’s explore different threshold of words and documents that we can remove in the corpus.
plotRemoved(processed$documents, lower.thresh=seq(1,200, by=100))

# keep those words that appear more than 1% in the document corpus
threshold1 <- round(1/100 * length(processed$documents),0)

result1 <- prepDocuments(processed$documents,
                     processed$vocab,
                     processed$meta,
                     lower.thresh = threshold1)

# keep those words that appear more than 5% in the document corpus
threshold5 <- round(5/100 * length(processed$documents),0)

result5 <- prepDocuments(processed$documents,
                     processed$vocab,
                     processed$meta,
                     lower.thresh = threshold5)
## Removing 2298 of 6501 terms (3036 of 72241 tokens) due to frequency 
## Your corpus now has 50 documents, 4203 terms and 69205 tokens.

5.2 Number of topics

  • Based on the work from [@Lee], we can set K=0 to initialize the algorithm that yield the possible maximum number of topics.
numtopics1 <- stm::searchK(result1$documents,result1$vocab,K=0, prevalence = ~sp_avg + sent_avg + year + factor(industry), data=result1$meta,heldout.seed = 1) #100

numtopics5 <- stm::searchK(result5$documents,result5$vocab,K=0, prevalence = ~sp_avg + sent_avg + year + factor(industry), data=result5$meta,heldout.seed = 1) #67
  • For the numtopics1, the maximum number of topics is 100.
  • For the numtopics5, the maximum number of topics is 67.
  • From the above, let’s explore different Ks
numtopics1 <- stm::searchK(result1$documents,result1$vocab,K=c(10,25,40,55,70,85,100), prevalence = ~sp_avg + sent_avg + year + factor(industry), data=result1$meta,heldout.seed = 1) 

numtopics5 <- stm::searchK(result5$documents,result5$vocab,K=c(10,20,30,40,50,60,70), prevalence = ~sp_avg + sent_avg + year + factor(industry), data=result5$meta,heldout.seed = 1) 
plot(numtopics1) # k = 7-12

  • From the figure above, we can see that the Held-Out likelihood slightly diminishes from 10 topics to 25 topics after which the number rapidly declines. In addition, the semantic coherence shows a downward trend across all possible number of topics. – This signifies that the optimal number of topics is potentially less than 10.
plot(numtopics5) # k = 7-12

  • From the plot above, the Held-Out likelihood and semantic coherence present the downward trend. – This signifies that the optimal number of topics is potentially less than 10.
  • Let’s search for the optimal number of topics again.
numtopics1.1 <- stm::searchK(result1$documents,result1$vocab,K=seq(from=5, to=10, by=1), prevalence = ~sp_avg + sent_avg + year + factor(industry), data=result1$meta,heldout.seed = 1)

numtopics5.1 <- stm::searchK(result5$documents,result5$vocab,K=seq(from=5, to=10, by=1), prevalence = ~sp_avg + sent_avg + year + factor(industry), data=result5$meta,heldout.seed = 1)
#par(mar=c(1,1,1,1))
plot(numtopics1.1) # k = 8

knitr::kable(numtopics1.1$results)
K exclus semcoh heldout residual bound lbound em.its
5 8.421397 -7.830872 -7.639919 26.84649 -884444.4 -884439.6 273
6 8.580645 -6.902828 -7.644201 1.307503 -881438.6 -881432 100
7 8.521526 -4.640247 -7.591978 3.018698 -874803.8 -874795.3 63
8 8.90089 -5.456275 -7.646804 2.106024 -873097 -873086.4 139
9 8.793623 -6.06389 -7.88957 2.6335 -872954.5 -872941.7 85
10 8.88515 -5.893097 -8.303825 3.166167 -871926.5 -871911.4 57
  • According to the table and plot above, the optimal number of topics is 8 since it presents the 2nd highest score of semantic coherence with lower residuals compared to 7.
plot(numtopics5.1) # k = 5

knitr::kable(numtopics5.1$results)
K exclus semcoh heldout residual bound lbound em.its
5 8.093306 -4.3695 -7.607057 0.9906994 -852751 -852746.2 191
6 8.339352 -4.942364 -7.602807 1.02831 -851002.6 -850996 104
7 9.086524 -12.82152 -7.545167 2.092363 -845654.4 -845645.9 113
8 8.949104 -7.818806 -7.682021 1.366209 -844172.8 -844162.2 171
9 9.010704 -6.677006 -8.021299 1.45672 -841532.6 -841519.8 183
10 8.888684 -5.403103 -8.24289 2.544388 -839108.7 -839093.6 154
  • From the figures above, the optimal number of topics is 5 since it presents the 3rd highest score of Held-Out likelihood with the highest semantic coherence compared with K= 6,7,8.

5.3 Model estimation

  • We run all the topics as dependent variables against the predictors, which are the average difference in price, average polarity, year, and industry
# fitting stm model using the number of topics (K) = 8 with threshold = 1%
suppressWarnings(library(stm))
finfit1 <- stm::stm(documents = result1$documents,
               vocab = result1$vocab,
               K = 8,
               prevalence = ~sp_avg + sent_avg + year + factor(industry),
               #max.em.its = 75, 
               data = result1$meta,
               reportevery=10,
               gamma.prior = "L1",
               sigma.prior = 0.9,
               init.type = "Spectral",seed = 123)
suppressWarnings(library(stm))
finfit5 <- stm::stm(documents = result5$documents,
               vocab = result5$vocab,
               K = 5,
               prevalence = ~sp_avg + sent_avg + year + factor(industry),
               #max.em.its = 100, 
               data = result5$meta,
               reportevery=10,
               gamma.prior = "L1",
               sigma.prior = 0.9,
               init.type = "Spectral",seed = 123)
  • Let’s compare the two final model using semantic coherence and exclusivity as the evaluation metrics.
suppressWarnings(library(ggplot2))
suppressWarnings(library(plotly))
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:Hmisc':
## 
##     subplot
## The following object is masked from 'package:igraph':
## 
##     groups
## The following object is masked from 'package:ggmap':
## 
##     wind
## The following object is masked from 'package:qdap':
## 
##     %>%
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:sentimentr':
## 
##     highlight
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
exsem_m1 <- as.data.frame(cbind(c(1:8),exclusivity(finfit1), semanticCoherence(model=finfit1, result1$documents), "Model1"))
exsem_m5 <- as.data.frame(cbind(c(1:5),exclusivity(finfit5), semanticCoherence(model=finfit5, result5$documents), "Model5"))

exsem_comparison <- rbind(exsem_m1, exsem_m5)
colnames(exsem_comparison) <- c("K","Exclusivity", "SemanticCoherence", "Model")

exsem_comparison$Exclusivity<-as.numeric(as.character(exsem_comparison$Exclusivity))
exsem_comparison$SemanticCoherence<-as.numeric(as.character(exsem_comparison$SemanticCoherence))

options(repr.plot.width=7, repr.plot.height=7, repr.plot.res=100)

plotexsem <-ggplot(exsem_comparison, aes(SemanticCoherence, Exclusivity, color = Model))+geom_point(size = 2, alpha = 0.7) + 
geom_text(aes(label=K), nudge_x=.05, nudge_y=.05)+
  labs(x = "Semantic coherence",
       y = "Exclusivity",
       title = "Exclusivity VS. Semantic Coherence")

plotexsem

  • The higher score of semantic coherence, the higher understanding of each topic.
  • Exclusivity means the uniqueness of related-words for each topic.
  • This means we should evaluate the model, which lays on the upper right of the graph above. Hence, the final model is model1, which contains 8 topics.

5.4 Analysis

5.4.1 Topic-Document Association

  • Let’s explore document-topic association using a summary view of data frame.
topic1_9 <- make.dt(finfit1, result1$meta)
View(topic1_9[,-c("tp")])
  • Let’s visualize the proportion of topic for each document.
suppressWarnings(library(tidytext)) # prevent warning error
theta <- tidytext::tidy(finfit1, matrix = "theta")

theta_df1 <- theta[theta$document%in%c(1:17),] # select the first 17 documents.

thetaplot1 <- ggplot(theta_df1, aes(y=gamma, x=as.factor(topic), fill = as.factor(topic))) +
  geom_bar(stat="identity",alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~ document, ncol = 3) +
  labs(title = "Theta values per document",
       y = expression(theta), x = "Topic")

# plot
thetaplot1

  • We could see that there are high proportion of topic 8 in the documents 1-10, which represents the App industry while for document 11-20, the dominant topic is 7 with a combination of topic 3.
theta_df2 <- theta[theta$document%in%c(18:34),] # select the next 17 documents.

thetaplot2 <- ggplot(theta_df2, aes(y=gamma, x=as.factor(topic), fill = as.factor(topic))) +
  geom_bar(stat="identity",alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~ document, ncol = 3) +
  labs(title = "Theta values per document",
       y = expression(theta), x = "Topic")

# plot
thetaplot2

  • For the documents 21-30, which are IT Consult industry, we can see the dominant topic is 5 whereas for the document number 31-40, there are topics with similar proportion, which are topic 1, 4, and 6.
theta_df3 <- theta[theta$document%in%c(35:51),] # select the next 17 documents.

thetaplot3 <- ggplot(theta_df3, aes(y=gamma, x=as.factor(topic), fill = as.factor(topic))) +
  geom_bar(stat="identity",alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~ document, ncol = 3) +
  labs(title = "Theta values per document",
       y = expression(theta), x = "Topic")

# plot
thetaplot3

  • For documents 41-50, which are Tech Hardware industry, the dominant topics are topic 1 and 2.

5.4.2 Topic Correlation

  • Topic 1, 2, 4, and 6 are correlated where they are related to Tech Hardware and Semi Con industries. This signifies that these industries are potentially have close topics in the MD&A part.
  • Topic 3 and 7 are related where they are highly associated with the DP industry.
  • Topic 5 is mostly related to the IT Consult industry.
  • Topic 8 is highly relevant to the App industry.
mod.corr <- topicCorr(finfit1)
plot(mod.corr)

5.4.3 Topic-Word Association

  • Topic coherence vs exclusivity
topicQuality(finfit1,documents=result1$documents)
## [1] -1.235813 -7.276399 -4.163224 -5.590105 -4.152409 -9.224564 -1.967508
## [8] -1.369184
## [1] 8.957245 9.798218 8.528626 8.960868 9.091806 9.283643 7.875148 8.383260

  • Topic Proportion
# topic distribution and top 5 word association
plot.STM(finfit1, "summary", n=5) 

  • We see that topic 5 has the highest proportion to the corpus, followed by topic 8, 7, 1, and 2.
labelTopics(finfit1, topics=c(5,8,7,1,2), n=10)
## Topic 5 Top Words:
##       Highest Prob: information, partially, reportable, spending, estimate, measure, termination, environment, repurchase, cancellation 
##       FREX: information, achievement, staff, engagement, belief, billing, contractually, efficiency, effectiveness, team 
##       Lift: abacus, abr, adaptable, adomain, adversary, aeronautic, affair, afghanistan, agnostic, airborne 
##       Score: information, predict, skill, quota, bid, unbilled, absent, sciences, involuntary, booking 
## Topic 8 Top Words:
##       Highest Prob: technical, marketing, contents, customer, equivalents, stock, term, future, period, change 
##       FREX: complementary, training, ratably, upgrade, accuracy, vsoe, functionality, trademark, human, combination 
##       Lift: advertiser, ansy, archive, attract, basel, behance, beta, beverage, boston, browser 
##       Score: physically, unlimited, edition, detailead, budgetary, subscriber, publish, invoice, entitle, landlord 
## Topic 7 Top Words:
##       Highest Prob: operating, amount, growth, capital, fee, period, change, statement, institution, estimate 
##       FREX: fraud, loyalty, brazil, retail, migration, positively, regional, processing, house, institution 
##       Lift: absorb, alter, ancillary, apparel, brande, cbsm, checkwriter, custody, dishonored, drop 
##       Score: loyalty, delinquent, valuable, fraud, sponsorship, signature, pool, advisory, transactional, servicer 
## Topic 1 Top Words:
##       Highest Prob: repatriation, dividend, payable, exchange, debt, activity, adjustment, allowance, impairment, program 
##       FREX: analysis, taxation, potentially, repatriation, oem, rata, vigorously, unanticipate, eventual, explanation 
##       Lift: accompany, acs, actuary, adjacency, adjusted, advantaged, adviser, affiliates, airplane, airpod 
##       Score: analysis, oem, products, inventories, accompany, cents, connectkey, consumable, cooperation, dfe 
## Topic 2 Top Words:
##       Highest Prob: limited, ordinary, indefinitely, worldwide, respect, appeal, supplier, reference, determinable, headcount 
##       FREX: supplier, appeal, limited, ordinary, word, indefinitely, repair, worldwide, determinable, irs 
##       Lift: aftershock, alloy, amk, balanced, broadening, cessation, clearance, committe, discussion, disks 
##       Score: discussion, inventories, desktop, audio, inch, thailand, factory, handset, flash, intensive
  • Let’s plot 10 alternative words for the topics 5, 8, 7, 1, and 2.
plot.STM(finfit1, "labels", topics=c(5,8,7,1,2), label="frex", n=10, width=60)

  • Top words per topic
beta <- tidytext::tidy(finfit1) 

options(repr.plot.width=7, repr.plot.height=8, repr.plot.res=100) 

beta %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  mutate(topic = paste0("Topic ", topic),
         term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = as.factor(topic))) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  labs(x = NULL, y = expression(beta),
       title = "Highest word probabilities by topic")

  • Generate documents that are highly associated with topics.
thoughts5 <- findThoughts(finfit1, texts = toprocess$tp,
n = 2, topics = 5)$docs[[1]]

thoughts8 <- findThoughts(finfit1, texts = toprocess$tp,
n = 2, topics = 8)$docs[[1]]

thoughts7 <- findThoughts(finfit1, texts = toprocess$tp,
n = 2, topics = 7)$docs[[1]]

thoughts1 <- findThoughts(finfit1, texts = toprocess$tp,
n = 2, topics = 1)$docs[[1]]

thoughts2 <- findThoughts(finfit1, texts = toprocess$tp,
n = 2, topics = 2)$docs[[1]]

plotQuote(thoughts5, width = 100, maxwidth = 600,text.cex=1,main = "Topic5")

plotQuote(thoughts8, width = 100, maxwidth = 600,text.cex=1,main = "Topic 8")

plotQuote(thoughts7, width = 100, maxwidth = 600,text.cex=1,main = "Topic 7")

plotQuote(thoughts1, width = 100, maxwidth = 600,text.cex=1,main = "Topic 1")

plotQuote(thoughts2, width = 100, maxwidth = 600,text.cex=1,main = "Topic 2")

  • Word contrasts between topics
plot.STM(finfit1, type = "perspectives", topics=c(8,7), label="frex", n=10, width=100)

plot.STM(finfit1, type = "perspectives", topics=c(8,5), label="frex", n=10, width=60)

plot.STM(finfit1, type = "perspectives", topics=c(7,5), label="frex", n=16, width=60)

– We see that there are many overlapping among words. The distinctive word for topic 5 is information while for topic 8, it is technical.

  • From the above, we can assign topic labels, which are as follows: – Topic 1: Consumer and market analysis in Semiconductors
  • Topic 2: Revenue concern in Tech Hardware – Topic 3: Operation management in Data Processing – Topic 4: Operation in Semiconductors – Topic 5: Information revenue and spending in IT Consulting – Topic 6: Report discussion and stakeholder concern in Semiconductors – Topic 7: Growth and competitive analysis in Data Processing – Topic 8: Competitive skills and winning strategy in Application Software
topic_labels <- c("T1",
                  "T2",
                  "T3",
                  "T4",
                  "T5",
                  "T6",
                  "T7",
                  "T8")

5.4.4 Topic and Effect Estimation

  • SP diff to topics
effect1 = stm::estimateEffect(~ sp_avg + sent_avg + year + factor(industry), stmobj = finfit1, metadata = result1$meta)

effect1_sum = summary(effect1)
# effect1_sum$tables
plot(effect1, covariate = "sp_avg",
     topics = c(1:8),
     model = finfit1, method = "difference",
     cov.value1 = "+10", cov.value2 = "-10",
     xlab = "Low Difference … High Difference",
     xlim = c(-0.1,0.1),
     main = "Marginal Effects of Price Difference",
     #custom.labels =topic_labels,
     labeltype = "custom")

  • We observe that topic 8: Competitive skills and winning strategy in Application Software, seems to have the highest effect on the price difference followed by topic 2, 1, 6, and 7.
plot(effect1, covariate = "sent_avg",
     topics = c(1:8),
     model = finfit1, method = "difference",
     cov.value1 = "+5", cov.value2 = "-5",
     xlab = "Low Polarity … High Polarity",
     xlim = c(-0.3,0.3),
     main = "Marginal Effects of Polarity",
     #custom.labels =topic_labels,
     labeltype = "custom")

  • We observe that topic 7: Growth and competitive analysis in Data Processing , has the highest marginal effect on polarity, followed by topic 8: Competitive skills and winning strategy in Application Software.
  • Let’s see the relation of topic proportion across years.
for(i in 1:length(topic_labels)){
plot(effect1, covariate = "year",
     topics = i,
     model = finfit1, method = "continuous",
     # For this plotting we get the uper quantile
     # and low quantile of the price 
     xlab = "Year",
     # xlim = c(0,800),
     # ylim = c(-0.1,0.1),
     main = topic_labels[i],
     printlegend = FALSE,
     custom.labels =topic_labels[i],
     labeltype = "custom")
}

  • From the plots above, it is observed that time has no relation to the topic proportion since the confidence interval at 95% is high.

5.4.5 Regression Analysis

  • Let’s prepare the data for the regression analysis.
thetadf <- as.data.frame(finfit1$theta)
colnames(thetadf) <- topic_labels

reginfo <- cbind(result5$meta, thetadf) %>% 
  filter(!year == "2020") 

reginfo$documents <- NULL
  • Model fitting
model1.information <- lm(sp_avg~sent_avg+year+factor(industry),data=reginfo)

# adding xxx
model2.information <- lm(sp_avg~sent_avg+form_avg+year+factor(industry),data=reginfo)

# adding xxx
model3.information <- lm(sp_avg~sent_avg+year+factor(industry)+T1,data=reginfo)

# adding xxx
model4.information <- lm(sp_avg~sent_avg+year+factor(industry)+T1+T2,data=reginfo)

# adding xxx
model5.information <- lm(sp_avg~sent_avg+year+factor(industry)+T1+T2+T3,data=reginfo)
# adding xxx
model6.information <- lm(sp_avg~sent_avg+year+factor(industry)+T1+T2+T3+T4,data=reginfo)

# adding xxx
model7.information <- lm(sp_avg~sent_avg+year+factor(industry)+T1+T2+T3+T4+T5,data=reginfo)

# adding xxx
model8.information <- lm(sp_avg~sent_avg+year+factor(industry)+T1+T2+T3+T4+T5+T6,data=reginfo)

# adding xxx
model9.information <- lm(sp_avg~sent_avg+year+factor(industry)+T1+T2+T3+T4+T5+T6+T7,data=reginfo)

# adding xxx
model10.information <- lm(sp_avg~sent_avg+year+factor(industry)+T1+T2+T3+T4+T5+T6+T7+T8,data=reginfo)
stargazer::stargazer(model1.information,model2.information,model3.information,model4.information,model5.information,model6.information,model7.information, model8.information,model9.information,model10.information,
          single.row = TRUE, # to put coefficients and standard errors on same line
          no.space = TRUE, # to remove the spaces after each line of coefficients
          omit.stat=c("f", "ser"),
          digits = 2,
          column.sep.width = "0pt",
          font.size = "tiny" ,# to make font size smaller
                     type = "text", out = "modelresult.html",
          covariate.labels=c("avg. sentiment","avg. formality","year",
 "DP","IT Consult","Semi Con","Tech Hardware","T1","T2","T3","T4","T5","T6","T7","T8","Intercept"))
  • Considering the adjusted R-squared, the best model that explains variation of the stock price difference is model 7 with the adjusted R-squared, 0.35.
  • We see that topic 5 has highly significant relation to the average of the stock price difference with the coefficient equals 202.52 (p < 0.05, SE = 93.6).
  • In addition, we see that IT Consult, DP, and Semi Con coefficients are significant, signifying that they are highly related to the average price difference. – It presents that Semi Con and DP has higher positive relationship to the average price while IT Consult has lower relationship compared with the App industry given changes of other variables are steady.
  • Annual period, year, does not significantly affect the average of stock price difference aligned with the previous part.
# all packages used in this project
subset(data.frame(sessioninfo::package_info()), attached==TRUE, c(package, loadedversion))
##                                   package loadedversion
## BatchGetSymbols           BatchGetSymbols         2.6.1
## broom                               broom         0.7.2
## caret                               caret        6.0-86
## cld3                                 cld3         1.4.1
## cluster                           cluster         2.1.0
## corrr                               corrr         0.4.3
## dplyr                               dplyr         1.0.2
## edgar                               edgar         2.0.3
## edgarWebR                       edgarWebR         1.1.0
## forcats                           forcats         0.5.0
## Formula                           Formula         1.2-4
## geometry                         geometry         0.4.5
## ggmap                               ggmap         3.0.0
## ggplot2                           ggplot2         3.3.2
## ggraph                             ggraph         2.0.5
## gridExtra                       gridExtra           2.3
## Hmisc                               Hmisc         4.5-0
## hunspell                         hunspell         3.0.1
## igraph                             igraph         1.2.6
## KernSmooth                     KernSmooth       2.23-17
## knitr                               knitr          1.30
## lattice                           lattice       0.20-41
## lda                                   lda         1.4.2
## lmtest                             lmtest        0.9-38
## lubridate                       lubridate         1.7.9
## magrittr                         magrittr           1.5
## matrixStats                   matrixStats        0.57.0
## mctest                             mctest         1.3.1
## mgcv                                 mgcv        1.8-33
## NCmisc                             NCmisc         1.1.6
## nlme                                 nlme       3.1-149
## NLP                                   NLP         0.2-1
## PerformanceAnalytics PerformanceAnalytics         2.0.4
## plm                                   plm         2.4-1
## plotly                             plotly       4.9.2.1
## purrr                               purrr         0.3.4
## qdap                                 qdap         2.4.3
## qdapDictionaries         qdapDictionaries         1.0.7
## qdapRegex                       qdapRegex         0.7.2
## qdapTools                       qdapTools         1.3.5
## quantmod                         quantmod        0.4.17
## RColorBrewer                 RColorBrewer         1.1-2
## Rcpp                                 Rcpp         1.0.5
## RcppArmadillo               RcppArmadillo    0.10.1.0.0
## readr                               readr         1.4.0
## readtext                         readtext          0.80
## rsvd                                 rsvd         1.0.5
## Rtsne                               Rtsne          0.15
## rvest                               rvest         0.3.6
## scales                             scales         1.1.1
## SentimentAnalysis       SentimentAnalysis         1.3-4
## sentimentr                     sentimentr         2.7.1
## slam                                 slam        0.1-48
## SnowballC                       SnowballC         0.7.0
## stm                                   stm         1.3.6
## stringi                           stringi         1.5.3
## stringr                           stringr         1.4.0
## survival                         survival         3.2-7
## textcat                           textcat         1.0-7
## textshape                       textshape         1.7.1
## tibble                             tibble         3.0.4
## tidyquant                       tidyquant         1.0.3
## tidyr                               tidyr         1.1.2
## tidytext                         tidytext         0.3.1
## tidyverse                       tidyverse         1.3.0
## timeDate                         timeDate      3043.102
## tm                                     tm         0.7-8
## topicmodels                   topicmodels        0.2-12
## TTR                                   TTR        0.24.2
## udpipe                             udpipe         0.8.5
## widyr                               widyr         0.1.3
## wordcloud                       wordcloud           2.6
## xml2                                 xml2         1.3.2
## xts                                   xts        0.12.1
## zoo                                   zoo         1.8-8

References


  1. https://www.tidytextmining.com/tfidf.html↩︎